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ABSTRACT 



An experiment compared the tagging of two languages: Czech, 
a highly inflected language with a high degree of ambiguity, and English. For 
Czech, the corpus was one gathered in the 1970s at the Czechoslovak Academy 
of Sciences; for English, it was the Wall Street Journal corpus. Results 
indicate 81.53 percent- accuracy for Czech and 96.83 percent accuracy for 
English, representing a higher level of accuracy than expected for Czech. 
Several simple improvements in the Czech tagging system were identified. 
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1. Introduction 

Highly inflectional languages like Czech pose a special problem for 
morphology disambiguation (which is usually called tagging. For example, 
the ending -u is not only highly ambiguous, but at the same time it carries a 
complex information: it corresponds, e.g., to genitive singular for inanimate 
nouns, or dative singular for animate nouns, or accusative singular for 
feminine nouns, or first person singular present tense active participle for 
certain verbs. 

Given the success of statistical methods in different areas including text 
tagging we wanted to try them even for the Czech language one of the main 
features of which is a rich inflection displaying a high degree of ambiguity. 
Originally we expected that the result would be plain negative, getting not 
more than about two thirds of the tags correct. However, as we show later, 
we got better results than we had expected. 

We used the same statistical approach to tag both the English text and 
Czech text. For English, we obtained results comparable with the results 
presented in [Brill 1993] (who uses different methods). For Czech, we 
obtained results which are less satisfying than those for English results. 



2. Data Used 

2A For Czech 

For training, we used the corpus collected at the beginning of the 70ies in 
the Czechoslovak Academy of Sciences. The corpus was originally 
hand-tagged, including the lemmatization and syntactic tags. The complete 
size of the corpus is 600k tokens. We had to do some cleaning and 
conversion, as we were interested in the words and tags only. 

2. 2 For English 

For training, we used Wall Street Journal [Marcus, Santorini, 
Marcinkiewicz 1993]. We had to change the format of WSJ to prepare it for 
our tagging software. 
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3. Tags 

3.1 Czech tags 

The original tag system (in the hand-tagged corpus) was too detailed to use 
it directly. We disregarded all the other information (lemmatization and 
syntactic tags) from the training data. We used the traditional division into 
the part of speech tagger classes. Each class contains many tags for each 
combination of morphological categories. For a description of the tags for the 
part of speech classes see Table 1. The first letter represents the tag for the 
part of speech class and it is followed by the morphological categories for the 
given class. We used special tags for sentence boundaries, punctuation and 
“unknown tag”. We used 1171 different tags in our experiment for Czech. 
They were manually derived from the training corpus. 



nouns 


N gender number case 




abbreviation 


Z 


adjectives 


A gender number case degree negation 


verbs 


V 

infinitive 


T negation 




transgressive 


W number tense voice gender negation 




common 


person number voice tense mood gender negation 


pronouns 


P 

personal 


P person number case 






3 gender number case 




possessive 


R gender-of-the-possessive number-of-the-possessive 






case person gender number 




svuj 


S gender number case 




se 


E case 




others 


D gender number case negation 


adverbs 


O 




conjunctions 


s 




numbers 


c 




prepositions 


R 




interjections 


F 




particles 


K 





Table 1 
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For example: 

NMSl (noun, masculinum animate, singular, nominative) 

NNP7 (noun, neuter, plural, instrument^) 

VTA (verb, infinitive, affirmative) 

V3SAPOMA (verb, 3rd person, singular, active, present tense, indicative, 
mas. anim., affirmative) 

PP2P7 (personal pronoun, 2nd person, plural, instrumental) 

AFP32N (adjective, femin. plural, dative, comparative, negative) 



3^ English tags 

We used The Penn Treebank tagset which contains 36 Part-Of-Speech tags 
and 12 other tags (for punctuation and the currency symbol). A detailed 
description is available in [Santorini 1990]. 



4. The algorithms 

We have used Merialdo’s methods (described e.g., in [Merialdo 1992]). The 
tagging procedure selects a sequence of tags T for the sentence W: 



: W ^ T - 4>(W). 



In this case the optimal tagging procedure is 

4>(W) = argmaxPr(T | W) = argmax Pr(T | W)"Pr(W) = argmaxPr(W | T) = 

T T T 

argmaxPr(W | T)*Pr(T) 

T 

Our implementation is based on generating the (W, T) pairs by a probabil- 
istic model using approximations of probability distributions Pr(W | T) and 
Pr(T). 

The Pr(T) is based on tag bigrams, and Pr(W | T) is approximated as the 
product of Pr(w| | t^). The parameters have been estimated by the usual maxi- 
mum likelihood training method, i. e. we approximated them as the relative 
frequencies found in the training data, smoothing them accordingly using the 
unigram frequencies and the uniform distribution. 
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5. The results 





Experiment for Czech 


Experiment for English 


corpus 


Czech hand-tagged 


Wall Street Journal 


trainig data (tokens) 


621 015 


1 287 749 


trainig data (words) 


72 445 


51 433 


trainig data (tags) 


1 171 


45 


training data (the 
average number of 
tags per token) 


3,65 


2,34 


test data (tokens) 


1 294 


1 294 


incorrect tags 


56 


41 


tagging accuracy 


81,53% 


96,83% 



To illustrate the results of our tagging procedures, we present here an 
example from the tagged test text. The cases of incorrect tag assignment are 
denoted by boldface letters. 

tagged word \ hand-assigned tag \ result of the tagging programme 



Czech test text 



English test text 



jmenem | Rjmenem | NNS7 
uv 1 NZ 1 NZ 
Ksc 1 NZ 1 NZ 

pozdravil | V3SAMOMA 1 NZ 
Davisovou | NFS4 | NZ 
Pavel 1 NMSl 1 NMSl 
Auersperg | NMSl | NMSl 
W_SB 1 T_SB 1 T_SB 
licastnici 1 NMPl 1 NMPl 
shromazdeni \ NNS2 1 NNS2 



In 1 IN 1 IN 
the 1 DT 1 DT 
lengthy | JJ | JJ 
discussion \ NN \ NN 
that 1 IN 1 WDT 
followed I VBD 1 VBD 
. 1.1 . 

Mr. 1 NNP 1 NNP 
Buffett 1 NNP 1 NNP 
said 1 VBD 1 VBD 
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6. Conclusion 

The results, however they might seem negative compared to English, are 
still better than our original expectations. We would like to improve current 
approach by another simple measures. For example, the average number of 
tags per token will increase after a morphological analyser is added as the 
front end to the tagger (serving as the “supplier” of possible tags). We also 
plan to use trigrams instead of bigrams after we collect more data for Czech. 
Finally, certain tagset reductions be carried one, as the original tagset (even 
after the reductions mentioned above) is too detailed (in the sense that it 
distinguishes tags hardly distinguishable by human annotators). We are also 
working on independent predictions for certain grammatical categories and 
the lemma itself, but the final shape of the model has not yet been decided. 
This would mean to introduce constraints on possible combinations of 
morphological categories and take them into account when, “assembling” the 
final tag. 
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